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Description 



[ULTRASONIC BEAMFORMER AND 

CORRELATOR] 

Background of Invention 
[ooo 1 ] Field of the Invention 

[0002] This invention relates to a printed circuit board that per- 
forms as an ultrasonic beamformer and a correlator that 
can be used alone or in multiples for medical or sonar 
imaging. The invention involves advances from imple- 
menting algorithms into the hardware. 

[0003] Description of Related Art 

[0004] a real-time correlator requires may fast multipliers work- 
ing in parallel to obtain preferably a normalized Sum of 
the Products (SOP) at the same speed as the beamformer 
output clock (40-60 MHz). Winprobe has been designing 
correlators with Xilinx FPCA"s making each fast 16-bit 
multiplier out of 25,000 gates. In December of 2000, Xil- 
inx Corporation made available the Virtex II series of FGPA 



chips that contain up to 10 million gates and 200 hard- 
ware 18x18 bit fast multipliers. The real benefit of these 
chips is the interconnectivity so signals can be directed to 
various filters without limitation or delay jitter. The 2-6 
million gate chips chosen by Winprobe for this design can 
have their 10 million bits of configuration programming 
loaded from a PC through a USB2 interface in two seconds 
in the field. 

[0005] This project will start with the current board that contains 
32 independently programmable transmit and receive 
channels, all support circuitry and two milti-million gate 
FPGA"s. They are nominally designated as the beam- 
former/scan converter FPGA and the Correlator FPGA, but 
they can be programmed with any multiple of tasks. Ten 
boards will be required to implement a 320-channel sys- 
tem (5 elevations of 64 channel phased arrays). The board 
is capable of producing 3 simultaneous receive lines from 
each transmit beam. The Correlator is capable of collect- 
ing data from the other boards to in concert produce 9 si- 
multaneous receive beams covering the lateral and eleva- 
tion dimensions. At the date of this proposal not all FPGA 
code has been written or evaluated. 

[0006] up to 18 bits of data from a beamformer can be input into 



the correlator chip, at least 60 MHz, where the RF line of 
up to 8000 points can be cross-correlated against previ- 
ous or simultaneously acquired lines with the Sum of the 
Products or the normalized SOP being computed and de- 
livered at the input frequency (60MHz). At least 27 of 
these SOP"s with their associated algorithms may reside 
within each chip. The current associated algorithm is: 
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peak detection to 1/256 of a clock pulse (complete sys- 
tem evaluation is required to see what level of accuracy is 
practically attainable in a real word noisy environment). 
Anticipated medical algorithms are: normalization, stress 
and elasticity. 

[0007] Earlier this year, a realization occurred that allowed what 
we call the Winprobe Advanced Correlation Algorithm 
(WACA) that obsoletes all previous work by reducing the 
cost of cross correlation by a factor of 30. 

[0008] The current state of the published art is as in Figure 1 
from the work of Duke University where the ensemble is 
tracked using 2:1 parallel receive: "an ensemble of two 
beam line sets is acquired along a given steering direc- 
tion. A kernel segment from beam 1 (K^ is tracked within 
the search segments in beam l(S i ) and beam 2(S 2 ) using a 
SAD algorithm." 



Brief Description of Drawings 

[0009] Figure 1 shows the state of published art in a schematic 
diagram showing an ensemble of two beam line sets ac- 
quired along a given steering direction. 

[0010] Figure 2 shows a schematic diagram of a conventional 

cross-correlation engine with two beams: T and T . 
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[0011] Figure 3 shows a schematic diagram of the present inven- 
tion using three cross- correlator engines. 
[0012] Figure 4 shows a schematic diagram of nine beam of 

cross-correlations 27cc magnitudes of target. 
Detailed Description 

[0013] An ultrasonic scanning apparatus for imaging that allows 
the three dimensional displacement vector of a backscat- 
terer imaging device to be estimated from the backscatter 
from partially overlapping beams that are cross-correlated 
in multiple cross-correlators instantiated in afield pro- 
grammable gate array or application specific integrated 
circuit especially useful in medical imaging. The cross- 
correlation may be formed by an algorithm wherein the 
elements (data points) of the kernel in one beam are used 
with the elements (data points) from the kernel in a sec- 
ond beam in such a fashion that both kernels are not 



shifted over each other but are shifted together on each 
clock pulse. 

[0014] The Sum of the Absolute Differences Algorithm is chosen 
because it is a far less computationally intensive algorithm 
than a true cross-correlation (normalized SOP). Most state 
of the art correlators uses SAD for this reason but it has 
significant compromises. As a best match is approached, 
the SAD tends to zero so by taking the values around the 
match to interpolate sub-clock match locations, the sta- 
tistical data is poor. In an SOP algorithm, the match is at 
the peak of a parabola where, for example, 32 samplings 
kernel of 16 bits per sampling at half average intensity 
has a parabola peak value of 32 billion that provides ex- 
cellent statistics for sub-clock match estimation. 

[0015] jo make a SOP engine of kernel length 32 requires 32 16 
bit fast multipliers that feed an adder funnel. 

[0016] | n the field of ultrasound cross-correlation imaging in the 
human body, the two beams that are cross-correlated 
may be acquired with less than 200 microseconds tempo- 
ral separation. A volume of blood or tissue moving at 10 
cm/sec would move 20 microns in 200 us. This 20 mi- 
crons is the distance sound will travel between one beam- 
former ADC clock samplings at 40 MHz. A few elements of 



the human body move faster than 10 cm/sec and we will 
show later how the faster velocities may be simply accom- 
modated. Initially, we may assume that under the above 
conditions, the shift or best kernel match will occur within 
plus or minus 1 clock shift. Thus, if we correlate the ker- 
nel of line t against the range in line t and restrict the 
search range to one point behind and one in front, the 
process is classic correlation. We may now process the 
three correlations simultaneously and a extraordinary 
simplification occurs. 
[0017] The Advanced Winprobe Cross-Correlator engine (AWCC) 
concept works initially on the assumption that many 
cross-correlators are available so the fastest way to 
search for a point of best match is to use in the simplest 
case 3 simultaneous cross-correlators to sweep down the 
acoustic line making correlation results at every step. One 
correlator is set in front, one at the correlated point and 
one behind. The three SOP results (either normalized or 
not) are sent at every clock to a peak detector algorithm 
that will output the displacement and magnitude for every 
clock. 

[0018] There is now a significant computational advantage for 
each SOP engine. As the stream of data points from the 



kernel in beam t is multiplied by the data points from the 

kernel in beam t , the kernels are not shifted over each 
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other but shifted together on each clock pulse so all the 
multiplications except the new one are already done thus 
only one multiplier is needed and the kernel of multipli- 
cands is kept as a sum where the new multiplicand is 
added and the last is subtracted on each clock. The small- 
est FPGA (XC2V2000) could not support 56 cross- 
correlators in the one chip. A further anticipated advan- 
tage of this design is that the kernel can be any length 
and this length can be a variable able to be changed in a 
few microseconds. A short length (~20 samples) would be 
expected to perform better than a longer one if any veloc- 
ity gradients are present. As implementing a triple cross- 
correlator is not a burden, multiple lengths could be run 
simultaneously and any loss of correlation due to a veloc- 
ity gradient will be estimated from the loss of magnitude 
output by the peak detector. The internal speed of the 
FPGA allows the correlation function to run at 180MHz 
while the beamformer will, in most cases, provide data at 
less than 60 MHz allowing each correlator to be run three 
times with different kernel lengths giving insights to any 
cause of de-correlation. Preliminary design has also be- 



gun on an interpolation engine so the RF data can be re- 
sampled at slightly higher and lower frequencies. These 
re-sampled RF vectors will then be applied to the AWCC 
engine to estimate the strain in the kernel. 

[0019] The algorithms described above have found a cross- 
correlation on every clock that is the shift of the voxel 
down the beam (~2- microns). We could not possible dis- 
play this amount of data and as a voxel will probably 
comprise of 32 digitizations, we can use this over- 
sampling of the correlators to reduce noise in an elegant 
averaging algorithm. 

[0020] For the case there the region contains elements moving 
faster than 10 cm/second; a) the forward and back cross- 
correlators could be set several clock pulses apart; b) the 
sampling frequency could be reduced by discarding every 
second digitization; c) additional correlators could be 
added at +1-2, 3 and 4 clocks and multiple peak detec- 
tors used. Thus, for blood flow in the ascending aorta at 
three meters per second and a 40 MHz clock, the separa- 
tion would be +/- 12 clocks. This over run condition is 
easily detected as no peak is detected between the in 
front and behind correlator values. Running multiple sets 
of triple correlators will assist in removing any false peaks 



that could be represent in on set. 
[0021] The task is defined as cross-correlating the backscatter 
signature from every voxel (target) in a center beam at 
time t with the 27 voxels that surround the target from 
the nine beams at time t . This will require 27 cross cor- 
relators each in the advanced correlator design consisting 
of three cross-correlator engines. Plus any engines in the 
axial direction required to account for velocities faster 
than 10 cm/sec if we program all the offsets to one clock 
for accuracy. A typical over-redundant configuration 
would be 99 engines. Each FPCA will support 56 engines 
easily and there are 10 FPGAs on the ten boards devoted 
to this task. 

[0022] Additional engines would, in some circumstances, be used 
to create the next t line from the advancing edge in the 
lateral direction for the next beam to increase the frame 
rate to the full acquisition speed of the beamformer. 

[0023] upon acquisition of the initial nine beams at t Q , the cross- 
correlations would be performed and though no flow 
would be obtained, the 27 correlation magnitudes would 
be stored for comparison with the 27 CC magnitudes of 
the target at t against the neighborhood at t to estimate 
decorrelation rate which could be due to turbulence or ve- 



locity gradients and require the kernels to be shortened 
which eventually could be automatic. 

[0024] |_ et us examine cross-correlation of a line of say 1000 el- 
ements against another line of 1000 elements with a ker- 
nel of 32 elements. 

[0025] c ase A: Conventional Correlation with no a priori assump- 
tions. The thirty-two elements of kernel the line at T are 
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multiplied by the first thirty-two elements of line at and 
the SOP is normalized requiring thirty-two multiplies. The 
target is then moved down in the range of T and the pro- 
cess repeated for each of the 1000-32 increments. This 
process requires 32,000 multiplies. The kernel is then in- 
cremented down one element and the process is repeated. 
The correlation of T q against T thus requires 32,000,000 
multiplies. 

[0026] c aS e B: Conventional Correlation with assumptions of the 
possible (small) shift. The thirty-two elements of kernel 
the line at T q are multiplied by the first thirty-two ele- 
ments of line T and the SOP is normalized requiring 
thirty-two multiplies. The target is then moved down in 
the range of T and the process repeated for each of the 
elements in the selected range. Let us assume the small- 
est range of three to cover any small forward or backward 



movement. This process requires 96 multiplies. The ker- 
nel is then incremented down one element and the pro- 
cess is repeated. The correlation of T q against T thus re- 
quires 96,000 multiplies. 
[0027] c ase C: Advanced Winprobe Cross-Correlator. The first 
element of the kernel in the line at T q are multiplied by 
the first element of line T and added to a register. The 
second. ..third. ..and finally the thirty-second elements 
multiplies and added to the register. This is the SOP of the 
first available kernel. Then, the thirty-third elements of 
both T q and T are multiplied and the product added to 
the register and the product of the first elements of is 

subtracted T and T from the register. This is the second 
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SOP. Then, the thirty-forth and ongoing SOP"s are found 
until the end of the lines are reached. In this way, the SOP 
of two lines is reached with only 1000 multiplies. This in 
itself only tells us how good the match was between a line 
T and a line T that was. Now if we did the process with T 
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offset from by + one element and again with T offset by 
one element we could estimate the shift of all the ele- 
ments in line T to line T . This is the same results as case 
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A and B but with only 3000 multiplies. If the three cross- 
correlations are performed simultaneously, the peak of 



the shift can be estimated on each shift and the SOPs dis- 
carded further reducing the circuitry and giving real time 
cross-correlation. One multiplier, a shift register and an 
adder and subtractor form one cross-correlator which 
may be budgeted as 75,000 + 512+ 100 + 100 gates 
whereas the traditional correlator requires 32 multipliers 
and 63 adders which requires 800,000 + 3100 gates. 
[0028] The instant invention has been shown and described 

herein in what is considered to be the most practical and 
preferred embodiment. It is recognized, however, that de- 
partures may be made therefrom within the scope of the 
invention and that obvious modifications will occur to a 
person skilled in the art. 



